autodiscovery: advanced auto-config discovery via Python discover() bridge by vitkyrka · Pull Request #50199 · DataDog/datadog-agent

vitkyrka · 2026-04-30T16:14:35Z

Summary

Generalises the krakend experiment into a reusable advanced-autoconfig path: rather than hard-coding an OpenMetrics prober and a %%discovered_port%% template variable in Go, the Agent now hands the probe decision to a Python discover(cls, service) classmethod on the integration's check class via a new rtloader bridge. The Python side decides what (if anything) to schedule and returns the fully-resolved instance configs back across the boundary.

This replaces the krakend-specific Go prober from the earlier revision of this PR with infrastructure any integration can opt into by:

Shipping an auto_conf_discovery.yaml (ad_identifiers: + discovery: {} presence marker, plus an instances: template that the Python side may override).
Adding a discover(cls, service) classmethod on its check class that returns list[dict] (or None on no match).

Tracks Confluence ticket DSCVR/6650004331.

Companion PR (Plan A helpers in datadog_checks_base.utils.discovery, the _run_discover Python bridge helper, and the krakend discover() migration): DataDog/integrations-core#23547 (branch vitkyrka/disco-autoconfig).

Implementation plan: docs/superpowers/plans/2026-05-06-discover-agent-bridge.md.

What's in this PR

New file format (auto_conf_discovery.yaml) — picked up by the existing file config provider (comp/core/autodiscovery/providers/config_reader.go). A non-nil discovery: block on integration.Config is the presence marker that this is a discovery template; the per-integration logic lives entirely on the Python side.

comp/core/autodiscovery/discoverer/ package — Go orchestration:

Discoverer / Bridge interfaces, decoupled from rtloader for testability.
defaultDiscoverer marshals the matched listeners.Service to JSON, calls the bridge, and converts the returned list-of-dicts into integration.Config values to schedule.
Cache keyed by (serviceID, integration_name); successes pinned, failures expire after 30s.
ErrPythonNotReady is treated as transient and not cached, so the next AD reconcile retries instead of sitting on a stale failure for the TTL.

rtloader run_discover bridge — new pure-virtual RtLoader::runDiscover, Three::runDiscover implementation (rtloader/three/three.cpp), C export in rtloader/rtloader/api.cpp, and the cgo wrapper pkg/collector/python/discover.go. The bridge calls datadog_checks.base.utils.discovery._run_discover(check_class, service_json) which builds a Service dataclass, invokes cls.discover(service), and returns the JSON-encoded result.

Lazy Python init from the bridge — mirrors the python check loader's existing pythonOnce.Do(InitPython) convention. Fixes the AD-vs-Python startup race for both the running agent and the agent check CLI subcommand without the rescan-on-ready plumbing that an earlier iteration of this branch carried (and that has since been reverted — see commits 7a95910 then 4c09170).

AD reconcile path — configmgr runs the discoverer before configresolver.Resolve whenever a template's Discovery field is set; on no-match the check is not scheduled (logged at DEBUG); on match the resolved instances are scheduled directly without going through any template-variable substitution.

Removed (vs. earlier revision of this PR) — the Go OpenMetrics prober, the serviceWithProbeResult wrapper, and the %%discovered_port%% template variable. The probe logic and any port hint handling now live in Python (krakend's discover() uses the http_probe + is_prometheus_exposition helpers from datadog_checks_base).

dev/e2e tooling

tasks/discovery_dev.py + test/dockerfiles/discovery-dev/ — dda inv discovery-dev.build-image produces an agent image with the dev tree bind-mounted, with a guard that fails fast when dda inv agent.build has re-linked rtloader against the host's libpython (the container ships Python 3.13).
docs/superpowers/2026-05-06-discover-e2e-smoke.md — manual smoke procedure (full build + bind-mount sequence) used to validate end-to-end against a real krakend container; intended as the basis for an automated harness.

Test plan

dda inv test --targets=./comp/core/autodiscovery/...,./pkg/collector/python — unit tests pass (discoverer with fake bridge, cache, providers, integration config).
dda inv linter.go — clean on touched packages.
bazel build //rtloader/... — C++ bridge builds; Three::runDiscover exercised through agent build.
End-to-end smoke against a real krakend:2.10 container per docs/superpowers/2026-05-06-discover-e2e-smoke.md: agent comes up, lazy-init triggers ~6s in, krakend check goes [OK] with 84 metrics/run sourcing http://<container-ip>:9090/metrics from the Python discover() result.
No-Python build path (e.g. cluster-agent): python_bridge_nopython.go stub keeps discoverer.New(nil) compiling and resolves discovery templates fail-closed.

Known limitation (carried forward)

The discoverer call still runs while the configManager mutex is held — serialises service reconciliation while Python is running. Acceptable for the experiment; should move outside the lock (or async) before broadening adoption.

🤖 Generated with Claude Code

dd-octo-sts · 2026-04-30T16:23:44Z

Go Package Import Differences

Baseline: 80e785f
Comparison: 15e6784

binary	os	arch	change
agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agent	windows	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agent	darwin	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agent	darwin	arm64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
agent	aix	ppc64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
iot-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
iot-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
heroku-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agent	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agent	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agent-cloudfoundry	linux	amd64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer
cluster-agent-cloudfoundry	linux	arm64	+1, -0 +github.com/DataDog/datadog-agent/comp/core/autodiscovery/discoverer

dd-octo-sts · 2026-04-30T16:43:42Z

Files inventory check summary

File checks results against ancestor 80e785f4:

Results for datadog-agent_7.80.0~devel.git.470.15e6784.pipeline.111287371-1_amd64.deb:

No change detected

dd-octo-sts · 2026-04-30T16:51:57Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 80e785f
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	+38.45 KiB (0.01% increase)	740.927 → 740.965 → 750.310
✅	agent_deb_amd64_fips	+38.49 KiB (0.01% increase)	699.115 → 699.153 → 702.690
✅	agent_heroku_amd64	+36.78 KiB (0.01% increase)	309.069 → 309.105 → 313.960
✅	agent_msi	+36.05 KiB (0.01% increase)	607.493 → 607.529 → 623.540
✅	agent_rpm_amd64	+38.45 KiB (0.01% increase)	740.911 → 740.949 → 750.280
✅	agent_rpm_amd64_fips	+38.49 KiB (0.01% increase)	699.099 → 699.137 → 702.670
✅	agent_rpm_arm64	+28.13 KiB (0.00% increase)	718.991 → 719.018 → 724.050
✅	agent_rpm_arm64_fips	+32.16 KiB (0.00% increase)	680.266 → 680.297 → 684.460
✅	agent_suse_amd64	+38.45 KiB (0.01% increase)	740.911 → 740.949 → 750.280
✅	agent_suse_amd64_fips	+38.49 KiB (0.01% increase)	699.099 → 699.137 → 702.670
✅	agent_suse_arm64	+28.13 KiB (0.00% increase)	718.991 → 719.018 → 724.050
✅	agent_suse_arm64_fips	+32.16 KiB (0.00% increase)	680.266 → 680.297 → 684.460
✅	docker_agent_amd64	+40.11 KiB (0.00% increase)	801.303 → 801.342 → 805.870
✅	docker_agent_arm64	+28.13 KiB (0.00% increase)	804.212 → 804.239 → 809.730
✅	docker_agent_jmx_amd64	+40.12 KiB (0.00% increase)	992.223 → 992.262 → 996.590
✅	docker_agent_jmx_arm64	+28.13 KiB (0.00% increase)	983.910 → 983.938 → 989.410
✅	docker_cluster_agent_amd64	+28.03 KiB (0.01% increase)	206.583 → 206.610 → 207.600
✅	docker_host_profiler_amd64	+3.19 KiB (0.00% increase)	301.103 → 301.106 → 315.800
✅	docker_host_profiler_arm64	+3.41 KiB (0.00% increase)	312.616 → 312.619 → 327.400
✅	iot_agent_deb_amd64	+28.03 KiB (0.06% increase)	44.454 → 44.482 → 44.970
✅	iot_agent_deb_arm64	+20.03 KiB (0.05% increase)	41.439 → 41.458 → 42.560
✅	iot_agent_deb_armhf	+20.02 KiB (0.05% increase)	42.175 → 42.194 → 42.740
✅	iot_agent_rpm_amd64	+28.03 KiB (0.06% increase)	44.455 → 44.482 → 44.970
✅	iot_agent_suse_amd64	+28.03 KiB (0.06% increase)	44.455 → 44.482 → 44.970

9 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	docker_cluster_agent_arm64	220.634 MiB
✅	docker_cws_instrumentation_amd64	7.142 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB
✅	docker_dogstatsd_amd64	39.370 MiB
✅	docker_dogstatsd_arm64	37.565 MiB
✅	dogstatsd_deb_amd64	30.024 MiB
✅	dogstatsd_deb_arm64	28.169 MiB
✅	dogstatsd_rpm_amd64	30.024 MiB
✅	dogstatsd_suse_amd64	30.024 MiB

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	+59.43 KiB (0.03% increase)	175.251 → 175.309 → 179.160
✅	agent_deb_amd64_fips	+43.51 KiB (0.03% increase)	166.983 → 167.026 → 174.440
✅	agent_heroku_amd64	+10.45 KiB (0.01% increase)	74.952 → 74.963 → 80.310
✅	agent_msi	-16.0 KiB (0.01% reduction)	140.594 → 140.578 → 148.730
✅	agent_rpm_amd64	+81.28 KiB (0.04% increase)	177.285 → 177.365 → 182.080
✅	agent_rpm_amd64_fips	+44.89 KiB (0.03% increase)	168.359 → 168.403 → 174.140
✅	agent_rpm_arm64	+19.46 KiB (0.01% increase)	159.361 → 159.380 → 163.610
✅	agent_rpm_arm64_fips	+14.34 KiB (0.01% increase)	151.703 → 151.717 → 156.850
✅	agent_suse_amd64	+81.28 KiB (0.04% increase)	177.285 → 177.365 → 182.080
✅	agent_suse_amd64_fips	+44.89 KiB (0.03% increase)	168.359 → 168.403 → 174.140
✅	agent_suse_arm64	+19.46 KiB (0.01% increase)	159.361 → 159.380 → 163.610
✅	agent_suse_arm64_fips	+14.34 KiB (0.01% increase)	151.703 → 151.717 → 156.850
✅	docker_agent_amd64	+27.76 KiB (0.01% increase)	267.680 → 267.707 → 272.990
✅	docker_agent_arm64	+17.08 KiB (0.01% increase)	254.704 → 254.720 → 261.470
✅	docker_agent_jmx_amd64	+50.99 KiB (0.01% increase)	336.309 → 336.358 → 341.610
✅	docker_agent_jmx_arm64	-5.01 KiB (0.00% reduction)	319.361 → 319.357 → 326.050
✅	docker_cluster_agent_amd64	neutral	72.413 MiB → 73.460
✅	docker_cluster_agent_arm64	+8.87 KiB (0.01% increase)	67.866 → 67.875 → 68.680
✅	docker_cws_instrumentation_amd64	neutral	2.999 MiB → 3.330
✅	docker_cws_instrumentation_arm64	neutral	2.729 MiB → 3.090
✅	docker_host_profiler_amd64	+15.6 KiB (0.01% increase)	110.742 → 110.757 → 125.600
✅	docker_host_profiler_arm64	+8.1 KiB (0.01% increase)	105.070 → 105.078 → 120.000
✅	docker_dogstatsd_amd64	neutral	15.238 MiB → 15.870
✅	docker_dogstatsd_arm64	neutral	14.554 MiB → 14.890
✅	dogstatsd_deb_amd64	neutral	7.941 MiB → 8.830
✅	dogstatsd_deb_arm64	neutral	6.826 MiB → 7.750
✅	dogstatsd_rpm_amd64	neutral	7.954 MiB → 8.840
✅	dogstatsd_suse_amd64	neutral	7.954 MiB → 8.840
✅	iot_agent_deb_amd64	+7.34 KiB (0.06% increase)	11.702 → 11.709 → 13.210
✅	iot_agent_deb_arm64	+6.32 KiB (0.06% increase)	9.995 → 10.001 → 11.620
✅	iot_agent_deb_armhf	+4.33 KiB (0.04% increase)	10.204 → 10.208 → 11.780
✅	iot_agent_rpm_amd64	+7.51 KiB (0.06% increase)	11.717 → 11.725 → 13.230
✅	iot_agent_suse_amd64	+7.51 KiB (0.06% increase)	11.717 → 11.725 → 13.230

cit-pr-commenter-54b7da · 2026-04-30T17:03:58Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 51b861f0-7cbc-4885-80ce-9af0ac915eed

Baseline: 80e785f
Comparison: 15e6784
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	-0.20	[-3.13, +2.73]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_logs	% cpu utilization	+0.70	[-0.28, +1.68]	1	Logs bounds checks dashboard
➖	tcp_syslog_to_blackhole	ingress throughput	+0.65	[+0.47, +0.83]	1	Logs
➖	otlp_ingest_logs	memory utilization	+0.47	[+0.37, +0.57]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.27	[+0.11, +0.43]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	+0.14	[-0.05, +0.33]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.03	[-0.50, +0.56]	1	Logs
➖	docker_containers_memory	memory utilization	+0.02	[-0.08, +0.12]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.01	[-0.41, +0.44]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.01	[-0.40, +0.41]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.20, +0.19]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	-0.01	[-0.20, +0.19]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.01	[-0.10, +0.09]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.16, +0.11]	1	Logs
➖	quality_gate_idle	memory utilization	-0.06	[-0.11, -0.02]	1	Logs bounds checks dashboard
➖	ddot_metrics	memory utilization	-0.10	[-0.30, +0.09]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.10	[-0.27, +0.06]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	-0.17	[-0.41, +0.06]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.18	[-0.23, -0.13]	1	Logs
➖	docker_containers_cpu	% cpu utilization	-0.20	[-3.13, +2.73]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.34	[-0.38, -0.30]	1	Logs bounds checks dashboard
➖	ddot_logs	memory utilization	-0.76	[-0.82, -0.70]	1	Logs
➖	file_tree	memory utilization	-0.78	[-0.83, -0.74]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	-1.23	[-1.47, -0.98]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	681 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	244.51MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	715 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.16GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.21GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.17GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.19GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	139.91MiB ≤ 147MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	467.72MiB ≤ 495MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	175.72MiB ≤ 195MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	349.22 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	3 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	370.03MiB ≤ 430MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.

vitkyrka · 2026-05-05T13:16:59Z

Reopening as a draft from the renamed branch vitkyrka/disco-autoconfig — see successor PR. The work is unchanged; GitHub doesn't allow changing the head branch of an existing PR.

For the advanced auto-config experiment. New optional field on integration.Config, populated by the auto_conf_discovery.yaml provider in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Recognise the discovery: block in the file format and populate integration.Config.Discovery. The file is picked up via the existing .yaml extension matcher; only the configFormat struct gains a new field and GetIntegrationConfigFromFile copies it into the returned integration.Config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Hints first (when exposed), then remaining exposed ports in declared order. Dedup-aware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-(serviceID, configHash) cache. Successes never expire; failures expire after caller-supplied TTL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HTTP-GET each candidate port + path with a 500ms per-probe budget and a 2s overall budget. Verify Content-Type is text/plain or application/openmetrics-text and that the body's first non-comment line is a Prometheus exposition line. Cache success/failure per (serviceID, config hash). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tiny shim so %%discovered_port%% resolution can flow through the existing GetExtraConfig path; no resolver signature change required. Also tightens fakeService.GetExtraConfig in the prober tests to error on unknown keys (matches the contract of real Service impls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Routes via Resolvable.GetExtraConfig("discovered_port"). Populated by autodiscovery/discovery's serviceWithProbeResult wrapper after a successful probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a Config has Discovery set, run the OpenMetrics prober against the matched Service before configresolver.Resolve. On match wrap the service so %%discovered_port%% resolves; on no match skip scheduling the check (logged at DEBUG). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SubstituteTemplateEnvVars is called at config-load time with a nil service. Without a nil check, GetDiscoveredPort panicked on res.GetExtraConfig. Match the pattern used by GetPort/GetPid/ GetHostname: return a NoResolverError early when res is nil so the caller can ignore it (config_reader.go:517 already does). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…plan Cross-language plan (Go + C++ + Python) for the Agent-side infrastructure that calls a Python discover() classmethod via rtloader, replacing the existing krakend-experiment Go prober and %%discovered_port%% template var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

autoconfig.go calls discoverer.NewPythonBridge() unconditionally; without this stub the symbol is undefined in builds where the python tag is absent (e.g. cluster agent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…config warn

Records the exact build + bind-mount sequence that successfully validates the Plan B implementation against a real krakend container. Includes the pitfalls hit during the manual run (Python ABI mismatch, RUNPATH/RPATH bind mounts, conf.d vs data/ confusion, Python init race) so an automated harness can avoid each one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous commit accidentally added "py" to ruff's exclude list to work around a pre-commit hook failure on a transient local working-tree directory. The directory is gone; revert the config change.

Surfaces ErrPythonNotReady from the Python bridge when rtloader has not yet initialised, and skips the negative cache for that error so the next AD reconcile event re-attempts the probe. Fixes a startup race where AD reconciles before Python init completes (~30s gap), caches the failure, and never re-probes in stable conditions — the krakend e2e smoke test previously had to bounce the target container to clear the cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the AD-vs-Python-init startup race for advanced auto-config templates. Previously, AutoDiscovery's first reconcile fired before rtloader.Initialize completed; the discoverer returned ErrPythonNotReady (uncached after the previous fix) and no future event triggered a retry in stable conditions, so the integration's check was never scheduled without manually bouncing the target container. - pkg/collector/python: signalPythonReady closes a once-channel at the end of Initialize; WaitReady blocks on it. - discoverer.WaitForPython is the public entry point (with a no-op stub for builds without the python tag, so cluster-agent compiles cleanly). - configmgr.rescanDiscoveryTemplates iterates active services with Discovery templates and re-runs reconcileService for each. - AutoConfig.start launches a fire-and-forget goroutine that waits for Python to be ready and then runs the rescan. The bridge MUST NOT block on Python init in the AD reconcile path: fx hooks are sequential and that would deadlock against the very hook that triggers Initialize. Verified end-to-end against the krakend tests/docker compose: krakend check is now scheduled ~9 s after agent start without any manual container bounce, sourcing http://<container-ip>:9090/metrics from the Python discover() result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the manual krakend-bounce step now that AutoConfig automatically re-reconciles services with discovery templates once Python is ready. Adds a note on the "skipped — python not yet ready" startup log being expected and benign, plus the dev/lib rtloader restore step (needed after every agent rebuild because cmake links against host Python 3.12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`dda inv agent.build` re-links rtloader against the host's python3.X-dev headers and overwrites the bazel-built .so files in dev/lib/. The resulting agent fails inside the discovery-dev image with `libpython3.12.so.1.0: cannot open shared object file` because the container ships Python 3.13. Detect this by extracting the libpython version the rtloader is linked against and confirming the matching libpython exists in dev/embedded/lib/ (where bazel installs it). Fail with the exact remediation commands instead of letting the user discover the issue inside the running agent container.

This reverts commit 7a95910. The rescan-on-Python-ready mechanism is being replaced by an in-bridge lazy InitPython that mirrors the python check loader's existing convention (loader.go: pythonOnce.Do(InitPython) when python_lazy_loading is true). The lazy-init shape is simpler, also fixes the CLI agent check subcommand (which hits the same race in a fresh process), and removes ~111 lines of one-shot recovery plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the python check loader convention (loader.go: pythonOnce.Do + InitPython when python_lazy_loading is true). The discoverer is just another consumer that needs Python; it runs init on demand if no earlier consumer has done so. This fixes the AD-vs-Python startup race for both the agent runtime path AND the CLI 'agent check' subcommand. The previous rescan-on-ready approach handled only the running-agent case (a fresh process re-runs discovery from scratch and never gets a future event to trigger the rescan). The pythonOnce sync.Once shared with the loader makes init idempotent across all callers. python_lazy_loading defaults to true; in eager mode the collector still inits Python in its constructor and the discoverer's check is a no-op. Verified end-to-end against the krakend tests/docker compose: no "skipped — python not yet ready" log, single straight-through "Initializing rtloader" triggered by the discoverer ~6 s after agent start, krakend check [OK] with 84 metrics/run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the "skipped — python not yet ready" log discussion and the rescan-goroutine description in favour of the new straight-through lazy-init path: the discoverer triggers InitPython via pythonOnce, and the krakend check appears [OK] within ~10 s of agent start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vitkyrka mentioned this pull request Apr 30, 2026

advanced auto-config: Python discover() bridge + 9 OpenMetrics integrations DataDog/integrations-core#23547

Draft

4 tasks

dd-octo-sts Bot added internal Identify a non-fork PR team/container-platform The Container Platform Team team/agent-log-pipelines labels Apr 30, 2026

github-actions Bot added the long review PR is complex, plan time to review it label Apr 30, 2026

This comment has been minimized.

Sign in to view

vitkyrka force-pushed the vitkyrka/advanced-autoconfig-krakend branch from 6f34723 to 15e6784 Compare May 4, 2026 16:05

dd-octo-sts Bot added team/agent-devx team/agent-runtimes labels May 4, 2026

vitkyrka changed the title ~~autodiscovery: declarative discovery probes (KrakenD experiment)~~ autodiscovery: advanced auto-config discovery via Python discover() bridge May 5, 2026

vitkyrka closed this May 5, 2026

vitkyrka and others added 14 commits May 6, 2026 04:05

autodiscovery/discovery: scaffold package with ProbeResult/Prober types

2966db6

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

autodiscovery/discovery: candidate port ordering

6cf7d61

Hints first (when exposed), then remaining exposed ports in declared order. Dedup-aware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

autodiscovery/discovery: TTL probe cache

024f26f

Per-(serviceID, configHash) cache. Successes never expire; failures expire after caller-supplied TTL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tmplvar: add %%discovered_port%% template variable

a105959

Routes via Resolvable.GetExtraConfig("discovered_port"). Populated by autodiscovery/discovery's serviceWithProbeResult wrapper after a successful probe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

autodiscovery/discoverer: add types and interfaces

4675236

autodiscovery/discoverer: add cache keyed by (svc, integration)

a811370

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

autodiscovery/discoverer: add orchestrator with bridge interface

0d51325

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

vitkyrka and others added 17 commits May 6, 2026 04:05

rtloader: add runDiscover virtual for advanced auto-config

cdc1322

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

autodiscovery/discoverer: add cgo Python bridge

89c593c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

autodiscovery: replace prober with discoverer in configmgr

10cceac

autodiscovery: drop old Go prober and %%discovered_port%%

f97fcaa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

autodiscovery: fix discovery yaml parsing, digest comment, and multi-…

aba98dc

…config warn

test: add discovery-dev Dockerfile for krakend discovery e2e

a0301ee

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tasks: add discovery-dev.build-image

6150c5b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

tasks: drop unintended pyproject.toml exclude

b9dc03d

The previous commit accidentally added "py" to ruff's exclude list to work around a pre-commit hook failure on a transient local working-tree directory. The directory is gone; revert the config change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autodiscovery: advanced auto-config discovery via Python discover() bridge#50199

autodiscovery: advanced auto-config discovery via Python discover() bridge#50199
vitkyrka wants to merge 31 commits intomainfrom
vitkyrka/advanced-autoconfig-krakend

vitkyrka commented Apr 30, 2026 •

edited

Loading

Uh oh!

dd-octo-sts Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

dd-octo-sts Bot commented Apr 30, 2026 •

edited

Loading

Info

Uh oh!

cit-pr-commenter-54b7da Bot commented Apr 30, 2026 •

edited

Loading

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

Uh oh!

vitkyrka commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vitkyrka commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this PR

Test plan

Known limitation (carried forward)

Uh oh!

dd-octo-sts Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

This comment has been minimized.

dd-octo-sts Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files inventory check summary

Results for datadog-agent_7.80.0~devel.git.470.15e6784.pipeline.111287371-1_amd64.deb:

Uh oh!

dd-octo-sts Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

cit-pr-commenter-54b7da Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

vitkyrka commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vitkyrka commented Apr 30, 2026 •

edited

Loading

dd-octo-sts Bot commented Apr 30, 2026 •

edited

Loading

dd-octo-sts Bot commented Apr 30, 2026 •

edited

Loading

dd-octo-sts Bot commented Apr 30, 2026 •

edited

Loading

cit-pr-commenter-54b7da Bot commented Apr 30, 2026 •

edited

Loading